emLam - a Hungarian Language Modeling baseline

نویسنده

  • Dávid Márk Nemeskey
چکیده

This paper aims to make up for the lack of documented baselines for Hungarian language modeling. Various approaches are evaluated on three publicly available Hungarian corpora. Perplexity values comparable to models of similar-sized English corpora are reported. A new, freely downloadable Hungarian benchmark corpus is introduced.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Finite-state Transducer Base with Explicit Modeling of Ph

This article describes the design and the experimental evaluation of the first Hungarian large vocabulary continuous speech recognition (LVCSR) system. The architecture of the recognition system is based on the recently proposed weighted finite state transducer (WFST) paradigm. The task domain is the recognition of fluently read sentences selected from a major daily newspaper. Recognition perfo...

متن کامل

Explicit Modeling of Phonological Changes in Finite-state Transducer Based Hungarian Lvcsr

This article describes the operation and the experimental evaluation of the pronunciation modeling component of the first Hungarian large vocabulary continuous speech recognition system. The proposed method is based on the implementation of context dependent rewrite rules by weighted finite state transducers (WFSTs). The proposed phonological model decreases the error rate by 8.32% relatively c...

متن کامل

Finite-state transducer based hungarian LVCSR with explicit modeling of phonological changes

This article describes the design and the experimental evaluation of the first Hungarian large vocabulary continuous speech recognition (LVCSR) system. The architecture of the recognition system is based on the recently proposed weighted finite state transducer (WFST) paradigm. The task domain is the recognition of fluently read sentences selected from a major daily newspaper. Recognition perfo...

متن کامل

Finite-state Transducer Based Phonology and Morphology Modeling with Applications to Hungarian Lvcsr

This article introduces a novel approach to model phonology and morphosyntax in morpheme unit based speech recognizers. The proposed method is evaluated in our recent Hungarian large vocabulary continuous speech recognition (LVCSR) system. The architecture of the recognition system is based on the weighted finite state transducer (WFST) paradigm. The task domain is the recognition of fluently r...

متن کامل

Towards Automatic Transcription of Large Spoken Archives in Agglutinating Languages - Hungarian ASR for the MALACH Project

The paper describes automatic speech recognition experiments and results on the spontaneous Hungarian MALACH speech corpus. A novel morph-based lexical modeling approach is compared to the traditional wordbased one and to another, previously best performing morph-based one in terms of word and letter error rates. The applied language and acoustic modeling techniques are also detailed. Using uns...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1701.07880  شماره 

صفحات  -

تاریخ انتشار 2017